Regret Analysis of a Markov Policy Gradient Algorithm for Multiarm Bandits

نویسندگان

چکیده

We consider a policy gradient algorithm applied to finite-arm bandit problem with Bernoulli rewards. allow learning rates depend on the current state of rather than using deterministic time-decreasing rate. The forms Markov chain probability simplex. apply Foster–Lyapunov techniques analyze stability this chain. prove that, if are well-chosen, then is transient chain, and converges optimal arm logarithmic or polylogarithmic regret.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Regret Bounds for Restless Markov Bandits

We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner’s actions. We suggest an algorithm that after T steps achieves Õ( √ T ) regret with respect to the best policy that knows the distributions of all arms. No assumptions on the Markov chains are made except that they are irreducible. In addition, we sho...

متن کامل

analysis of ruin probability for insurance companies using markov chain

در این پایان نامه نشان داده ایم که چگونه می توان مدل ریسک بیمه ای اسپیرر اندرسون را به کمک زنجیره های مارکوف تعریف کرد. سپس به کمک روش های آنالیز ماتریسی احتمال برشکستگی ، میزان مازاد در هنگام برشکستگی و میزان کسری بودجه در زمان وقوع برشکستگی را محاسبه کرده ایم. هدف ما در این پایان نامه بسیار محاسباتی و کاربردی تر از روش های است که در گذشته برای محاسبه این احتمال ارائه شده است. در ابتدا ما نشا...

15 صفحه اول

Hidden Markov model multiarm bandits: a methodology for beam scheduling in multitarget tracking

In this paper, we derive optimal and suboptimal beam scheduling algorithms for electronically scanned array tracking systems. We formulate the scheduling problem as a multiarm bandit problem involving hidden Markov models (HMMs). A finite-dimensional optimal solution to this multiarm bandit problem is presented. The key to solving any multiarm bandit problem is to compute the Gittins index. We ...

متن کامل

Correction to "Hidden Markov model multiarm bandits: a methodology for beam scheduling in multitarget tracking"

We have discovered an error in the return-to-state formulation of the HMM multi-armed bandit problem in our recently published paper [4]. This note briefly outlines the error in [4] and describes a computationally simpler solution. Complete details including proofs of this simpler solution appear in the already submitted paper [3]. The error in [4] is in the return-to-state argument given in Re...

متن کامل

Tight Policy Regret Bounds for Improving and Decaying Bandits

We consider a variant of the well-studied multiarmed bandit problem in which the reward from each action evolves monotonically in the number of times the decision maker chooses to take that action. We are motivated by settings in which we must give a series of homogeneous tasks to a finite set of arms (workers) whose performance may improve (due to learning) or decay (due to loss of interest) w...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Mathematics of Operations Research

سال: 2022

ISSN: ['0364-765X', '1526-5471']

DOI: https://doi.org/10.1287/moor.2022.1311